Robust Multichannel Gender Classification from Speech in Movie Audio
نویسندگان
چکیده
Speech in the form of scripted dialogues forms an important part of the audio signal in movies. However, it is often masked by background audio signals such as music, ambient noise or background chatter. These background sounds make even otherwise simple tasks, such as gender classification, challenging. Additionally, the variability in this noise across movies renders standard approaches to source separation or enhancement inadequate. Instead, we exploit multichannel information present in different language channels (English, Spanish, French) for each movie to improve the robustness of our gender classification system. We exploit the fact that the speaker labels of interest in this case co-occur in each language channel. We fuse the predictions obtained for each channel using Recognition Output Voting Error Reduction (ROVER) and show that this approach improves the gender accuracy by 7% absolute (11% relative) compared to the best independent prediction on any single channel. In the case of surround movies, we further investigate fusion of mono audio and front center channels which shows 5% and 3% absolute (8% and 4% relative) increase in accuracy compared to only using mono and front center channel, respectively.
منابع مشابه
A Comparative Study of Gender and Age Classification in Speech Signals
Accurate gender classification is useful in speech and speaker recognition as well as speech emotion classification, because a better performance has been reported when separate acoustic models are employed for males and females. Gender classification is also apparent in face recognition, video summarization, human-robot interaction, etc. Although gender classification is rather mature in a...
متن کاملPractical Considerations for Real-Time Implementation of Speech-Based Gender Detection
This paper describes a detailed analysis and implementation of a robust gender detector for audio stream applications. The implementation, based on melcepstral features and a Gaussian mixture model classifier, is designed to maximize gender classification performance in continuous speech. The described detector outperforms other reported systems based on statistically significant numbers of gen...
متن کاملBinaural cue coding-Part I: psychoacoustic fundamentals and design principles
Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and BCC side information. The BCC side information has a low data rate and it is derived from the multichannel encoder input signal. A natural application of BCC is multichannel audio data rate reduction since only a single down-mixed audio channel needs to be transmitted. An alternati...
متن کاملrazer Maelstrom audio enGine
The evolution oF audio Unlike video technologies, which have seen a stream of new innovations over the years (color screens, ever improving resolutions, brighter LCD and plasma screens, various 3D vision enhancements for home and cinema, etc), stereo has been, since 1931, the predominant technology used for audio reproduction. 1877 – Monophonic sound reproduction is created and the phonograph i...
متن کاملMultimedia classification of movie shots using low-level and semantic features
Movie shots categorization may be approached by using audio and visual features for inferring high-level information about a movie shot. Low-level audio and visual features such as color and MFCC and mid-level features such as sky and speech detection have been used in multimedia understanding research. However, integrating all this features in a classifier remains a subject of study. In this p...
متن کامل